Skip to content

refactor(sandbox): move sandbox state management server-side (ENG-4287)#381

Merged
huv1k merged 1 commit into
mainfrom
move-sandbox-state-management-server-side-eng-4287
Jun 25, 2026
Merged

refactor(sandbox): move sandbox state management server-side (ENG-4287)#381
huv1k merged 1 commit into
mainfrom
move-sandbox-state-management-server-side-eng-4287

Conversation

@huv1k

@huv1k huv1k commented Jun 10, 2026

Copy link
Copy Markdown
Member

Moves sandbox state management server-side so the user's account-level access token is never exposed to browser JS (ENG-4287), and adds the ability to pause and resume a sandbox from the dashboard. Previously the sandbox-details surfaces drove the E2B SDK client-side with sandboxManagementAuth (the account token); now the browser only ever receives sandbox-scoped envd credentials.

Server-side state management

  • Filesystem inspector builds an envd-only SDK client via a shared createEnvdSandbox helper from the envdAccessToken already returned by the sandbox.details query — no control-plane call, so normal inspect never resumes a paused sandbox or extends its TTL.
  • Terminal (standalone /dashboard/terminal and the sandbox-details terminal tab) runs create/connect through a sandbox.openTerminal tRPC mutation; the client receives only scoped envd creds.
  • Explicit "Resume sandbox" (inspector) runs its control-plane connect through a new sandbox.resume tRPC mutation server-side (short SANDBOX_RESUME_TIMEOUT_MS window), then the client rebuilds the envd-only client from the returned creds. Resuming/extending TTL only happens on this explicit user action — never implicitly.
  • Tokenless (secure: false) sandboxes are supported end-to-end (envdAccessToken optional throughout).
  • The now-unused sandbox-management-auth modules are deleted; getFullInfo/sandbox.details remain read-only.

Pause / resume

  • Pause button next to Kill in the sandbox header, shown only while the sandbox is running, with a confirmation popover.
  • Pause runs server-side through a new sandbox.pause tRPC mutation using the SDK's Sandbox.pause() (consistent with sandbox.resume), so the account token stays off the client.
  • Resume a paused sandbox from the inspect Filesystem/Terminal empty states; the empty state notes the resume window and that the sandbox can be paused again.
  • Prevents auto-resume on pause: a memory: true pause can be auto-resumed by any envd traffic, so the live terminal PTY and filesystem watcher would immediately revive the sandbox. The Pause button now optimistically flips state to paused, tearing those connections down before the snapshot runs (rolled back on error); without the live traffic the snapshot completes and the sandbox stays paused.
  • Shows the "Sandbox Paused" empty state for both the filesystem and terminal instead of stale cached data.

Notes

  • Transport choice: terminal/resume/pause mutations use tRPC (vanilla client injected into the helpers), consistent with the sandbox router's existing queries.
  • origin/main merged in — reconciled main's resume UX and details terminal tab onto the no-token model, and adopted main's removal of the client-side attach-retry.

Tests: typecheck/lint clean on changed files.

Not yet verified live: on the preview deploy, (1) confirm via DevTools that envd requests carry only X-Access-Token — no account Authorization: Bearer / X-Supabase-Token (incl. an explicit Resume and a secure: false sandbox); (2) from the Terminal and Filesystem tabs, click Pause and confirm the sandbox shows the paused screen and does not auto-resume.

Known follow-up (pre-existing, from main): sandbox.killTerminalPty still does a control-plane Sandbox.connect (resume + extend TTL) just to kill a PTY on terminal close; it should become envd-only to avoid that side effect.

Before After
CleanShot 2026-06-25 at 19 44 57@2x CleanShot 2026-06-25 at 19 45 13@2x
CleanShot 2026-06-25 at 19 45 55@2x CleanShot 2026-06-25 at 19 46 30@2x

@linear-code

linear-code Bot commented Jun 10, 2026

Copy link
Copy Markdown

ENG-4287

@cla-bot cla-bot Bot added the cla-signed label Jun 10, 2026
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 25, 2026 5:39pm
web-juliett Error Error Jun 25, 2026 5:39pm

Request Review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aee0b2cd40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/server/api/routers/sandbox.ts Outdated
Comment thread src/features/dashboard/sandbox/inspect/context.tsx Outdated
huv1k added a commit that referenced this pull request Jun 16, 2026
…+ inspector

Addresses Codex P2 feedback on #381. The defensive guards required an
envdAccessToken (and non-null domain), but those are legitimately absent for
secure:false sandboxes (envd is reachable without X-Access-Token) and domain
can be null (the SDK falls back to the configured E2B domain) — regressing
support that worked before this PR.

- create-envd-sandbox: envdAccessToken is now optional.
- sandbox.openTerminal: stop throwing when getFullInfo has no envdAccessToken;
  pass it through as-is.
- inspect context: drop the !envdAccessToken || !domain early-return (keep the
  killed-state narrowing guard).
- test: cover connecting to a tokenless sandbox.
@huv1k

huv1k commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Addressed both Codex P2 comments in 37592de: envdAccessToken is now optional end-to-end, so secure: false sandboxes (envd reachable without X-Access-Token) and sandboxes with a null domain connect for both the terminal and the filesystem inspector. Removed the over-strict guards and added a tokenless-sandbox unit test.

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37592def2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/server/api/routers/sandbox.ts Outdated
… (ENG-4287)

Move sandbox control-plane state management to the server so account tokens
never reach the browser, and add the ability to pause and resume a sandbox
from the dashboard.

- Run sandbox connect/resume and pause server-side via tRPC mutations using
  the E2B SDK; the terminal and inspector use sandbox-scoped envd credentials.
- Add a Pause button next to Kill in the sandbox header, shown only while the
  sandbox is running, with a confirmation popover.
- Resume a paused sandbox from the inspect Filesystem/Terminal empty states,
  noting the resume window and that it can be paused again.
- Prevent a paused sandbox from auto-resuming: optimistically flip state to
  paused so the live terminal PTY and filesystem watcher tear down before the
  pause snapshot runs, instead of keeping the sandbox alive with envd traffic.
- Show the "Sandbox Paused" empty state for both the filesystem and terminal
  instead of stale cached data.
- Support tokenless (secure:false) sandboxes in the terminal and inspector,
  and guard against accidental resume / TTL extension while inspecting.
@huv1k huv1k force-pushed the move-sandbox-state-management-server-side-eng-4287 branch from b78f6d8 to b8db864 Compare June 25, 2026 17:47
@huv1k huv1k marked this pull request as ready for review June 25, 2026 18:54
@huv1k huv1k enabled auto-merge (squash) June 25, 2026 18:54

@ben-fornefeld ben-fornefeld left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@huv1k huv1k merged commit d22b836 into main Jun 25, 2026
14 checks passed
@huv1k huv1k deleted the move-sandbox-state-management-server-side-eng-4287 branch June 25, 2026 18:55

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8db86413c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +36 to +37
queryClient.setQueryData(detailsKey, (old) =>
old?.state === 'running' ? { ...old, state: 'paused' as const } : old

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent terminal cleanup from re-resuming paused sandboxes

When pausing from the sandbox terminal tab, this optimistic paused cache update unmounts DashboardTerminal immediately; its unmount cleanup calls requestPtyKill, and sandbox.killTerminalPty still performs Sandbox.connect(... timeoutMs: TERMINAL_SANDBOX_TIMEOUT_MS) before killing the PTY. If that keepalive request lands after the pause snapshot completes, it auto-resumes the sandbox and extends it for 30 minutes, so the new Pause action can appear to succeed while leaving an active sandbox running.

Useful? React with 👍 / 👎.

Comment on lines 95 to 99
sandboxConnectRequestTimeoutMs={
shouldResumeSandbox ? SANDBOX_TERMINAL_RESUME_TIMEOUT_MS : undefined
}
sandboxManagementAuth={sandboxManagementAuth}
sandboxScoped
teamSlug={team.slug}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The terminal-tab Resume button gives the sandbox a 30-minute TTL while the empty-state copy on the same screen promises "Runs for 70s" — a ~26× silent TTL extension, exactly the anti-pattern this PR's own inspect/constants.ts comment says it is trying to prevent. The filesystem-tab Resume correctly routes through sandbox.resume (70s TTL); only the terminal-tab Resume bypasses it by going through sandbox.openTerminal (TERMINAL_SANDBOX_TIMEOUT_MS = 30 min). Fix by routing the terminal-tab resume through sandbox.resume first (or by adding a resume mode to sandbox.openTerminal that honors SANDBOX_RESUME_TIMEOUT_MS) before attaching the PTY.

Extended reasoning...

Terminal-tab Resume silently extends sandbox TTL to 30 minutes

What the bug is

SandboxInspectNotFound is rendered on both the filesystem and terminal tabs for paused sandboxes and renders the unconditional message:

// inspect/not-found.tsx
<>
  Resume this sandbox to access the {resourceName}.
  <span className="mt-1 block text-fg-tertiary">
    Runs for {RESUME_TIMEOUT_SECONDS}s, then stops unless the timeout is
    extended. You can pause it again anytime.
  </span>
</>

where RESUME_TIMEOUT_SECONDS = Math.round(SANDBOX_RESUME_TIMEOUT_MS / 1000) = 70. So the user is told the resumed sandbox will run for 70 seconds.

On the filesystem tab this promise is honest: resumeSandbox() in inspect/context.tsx calls trpcClient.sandbox.resume.mutate(...), whose server mutation does Sandbox.connect(... timeoutMs: SANDBOX_RESUME_TIMEOUT_MS /* 70_000 */).

On the terminal tab, the same Resume button takes a completely different path that ignores SANDBOX_RESUME_TIMEOUT_MS entirely.

Code path that triggers it

Step-by-step trace from src/features/dashboard/sandbox/terminal/view.tsx:

  1. Sandbox is paused → SandboxInspectNotFound is shown with the "Runs for 70s" copy.
  2. User clicks Resume sandboxonResumeSandbox runs setShouldResumeSandbox(true).
  3. With shouldResumeSandbox=true, the empty-state branch at lines 67–80 is bypassed and DashboardTerminal is mounted at lines 84–101 with autoStart, launchTarget={sandboxId, template}, and sandboxConnectRequestTimeoutMs={SANDBOX_TERMINAL_RESUME_TIMEOUT_MS} (= 70_000).
  4. autoStart triggers queueTerminalCommandstartTerminalopenTerminalSandboxacquireTerminalSandboxtrpcClient.sandbox.openTerminal.mutate({ teamSlug, template, sandboxId, requestTimeoutMs: 70_000 }).
  5. The openTerminal mutation in src/core/server/api/routers/sandbox.ts hits the if (sandboxId) branch:
const sandbox = await Sandbox.connect(sandboxId, {
  ...connectionOpts,
  timeoutMs: TERMINAL_SANDBOX_TIMEOUT_MS, // 30 * 60 * 1000
  requestTimeoutMs,                       // 70_000 — HTTP timeout only
})

Sandbox.connect's timeoutMs is the SDK's sandbox-TTL parameter — exactly the field the inspect-side comment in constants.ts warns about. requestTimeoutMs is only the HTTP-call timeout, not the sandbox lifetime. So clicking Resume from the terminal tab grants the sandbox a 30-minute TTL.

Why existing code doesn't prevent it

The PR added SANDBOX_RESUME_TIMEOUT_MS and the sandbox.resume mutation specifically to avoid this, with an explicit comment:

Kept short on purpose: resuming for inspect should give a brief debug window, not silently extend a customer sandbox's lifetime.

The filesystem path in inspect/context.tsx calls sandbox.resume. But the terminal tab's Resume button never reaches sandbox.resume — instead it mounts DashboardTerminal, which only knows how to call sandbox.openTerminal, and that mutation hard-codes TERMINAL_SANDBOX_TIMEOUT_MS. There is no resume-aware code path on the terminal side.

Impact

A customer sandbox that was paused (possibly by the new Pause button itself) can be silently revived for 30 minutes of billable lifetime every time a developer opens its terminal tab and clicks the Resume button — while the UI on the very same screen tells them it will run for 70 seconds. This is a 26× discrepancy between the displayed lifetime and the actual lifetime, on the exact action the PR description claims has been hardened against this behaviour.

Step-by-step proof

  1. Open a sandbox-details page for a paused sandbox at /dashboard/{teamSlug}/sandboxes/{sandboxId}/terminal.
  2. The terminal tab renders SandboxInspectNotFound → message: "Resume this sandbox to access the terminal. Runs for 70s, then stops unless the timeout is extended."
  3. Click Resume sandbox.
  4. setShouldResumeSandbox(true) causes DashboardTerminal to mount with launchTarget={sandboxId, template} and sandboxConnectRequestTimeoutMs=70_000.
  5. DashboardTerminal auto-starts → openTerminalSandbox(... sandboxId, requestTimeoutMs: 70_000 ...)trpcClient.sandbox.openTerminal.mutate({ ..., sandboxId, requestTimeoutMs: 70_000 }).
  6. Server-side: Sandbox.connect(sandboxId, { ..., timeoutMs: 30 * 60 * 1000, requestTimeoutMs: 70_000 }) — sandbox is resumed with a 30-minute TTL.
  7. UI promise vs. reality: shown 70s, actual 1800s.

How to fix

Two viable options:

  1. Two-step on the client (preferred — matches filesystem flow): In SandboxTerminalView, when the user clicks Resume, call trpcClient.sandbox.resume.mutate({ sandboxId, requestTimeoutMs: SANDBOX_RESUME_TIMEOUT_MS }) first, then mount DashboardTerminal (which now just attaches a PTY to an already-running sandbox).

  2. Resume-aware openTerminal: Add a mode: 'resume' | 'attach' (or boolean) input to the sandbox.openTerminal mutation; in resume mode pass timeoutMs: SANDBOX_RESUME_TIMEOUT_MS to Sandbox.connect instead of TERMINAL_SANDBOX_TIMEOUT_MS, and have DashboardTerminal plumb the resume flag from SandboxTerminalView.

Option 1 keeps sandbox.resume as the single source of truth for the resume-TTL invariant and makes the terminal-tab flow structurally identical to the filesystem-tab flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants